ECC memory

Error-correcting code memory (ECC memory) is a type of computer data storage that can detect and correct the more common kinds of internal data corruption. ECC memory is used in most computers where data corruption cannot be tolerated under any circumstances, such as for scientific or financial computing and as servers.

ECC memory maintains a memory system effectively free from single-bit errors: the data read from each word is always the same as the data that had been written to it, even if a single bit actually stored, or more in some cases, has been flipped to the wrong state. Some non-ECC memory with parity support allows errors to be detected, but not corrected; otherwise errors that may occur are not detected.

History

Electrical or magnetic interference inside a computer system can cause a single bit of dynamic random-access memory (DRAM) to spontaneously flip to the opposite state. It was initially thought that this was mainly due to alpha particles emitted by contaminants in chip packaging material, but research[1] has shown that the majority of one-off ("soft") errors in DRAM chips occur as a result of background radiation, chiefly neutrons from cosmic ray secondaries, which may change the contents of one or more memory cells or interfere with the circuitry used to read/write them. There was some concern that as DRAM density increases further, and thus the components on DRAM chips get smaller, while at the same time operating voltages continue to fall, DRAM chips will be affected by such radiation more frequently—since lower-energy particles will be able to change a memory cell's state. On the other hand, smaller cells make smaller targets, and moves to technologies such as SOI may make individual cells less susceptible and so counteract, or even reverse, this trend. Recent studies[2] show that single event upsets due to cosmic radiation have been dropping dramatically with process geometry and previous concerns over increasing bit cell error rates are unfounded.

Several approaches have been developed to deal with these unwanted bit-flips:

This problem can be mitigated by using DRAM modules that include extra memory bits and memory controllers that exploit these bits. These extra bits are used to record parity or to use an error-correcting code (ECC). Parity allows the detection of all single-bit errors (actually, any odd number of wrong bits). The most common error correcting code, a SECDED Hamming code, allows a single-bit error to be corrected and (in the usual configuration, with an extra parity bit) double-bit errors to be detected.

Seymour Cray famously said "parity is for farmers" when asked why he left this out of the CDC 6600.[3] He included parity in the CDC 7600. The original IBM PC and all PCs until the early 1990s used parity checking.[4] Later ones mostly did not. Wider memory buses make parity and especially ECC more affordable. Many current microprocessor memory controllers, including almost all AMD 64-bit offerings, support ECC, but many motherboards and in particular those using low-end chipsets do not.

An ECC-capable memory controller as used in many modern PCs can typically detect and correct errors of a single bit per 64-bit "word" (the unit of bus transfer), and detect (but not correct) errors of two bits per 64-bit word. Some systems also 'scrub' the errors, by writing the corrected version back to memory. The BIOS in some computers, and operating systems such as Linux, allow counting of detected and corrected memory errors, in part to help identify failing memory modules before the problem becomes catastrophic.

Typically machines intended for server use support ECC. Most mainboards intended for desktop (rather than server) machines do not support ECC, and those that do support ECC are shipped with it disabled (but easily enabled, usually by a change to BIOS setup), to allow use of cheaper non-ECC memory. Most modern PCs do not support ECC at all as can be seen by examining computer and motherboard specifications; those whose motherboards do are often supplied with memory modules that do not support ECC. It may be that most users opt for non-ECC systems and memory even when ECC is available. The most important reasons for this are:[5][6]

Error detection and correction depends on an expectation of the kinds of errors that occur. Implicitly, we have assumed that the failure of each bit in a word of memory is independent and hence that two simultaneous errors are improbable. This used to be the case when memory chips were one bit wide (typical in the first half of the 1980s). Now many bits are in the same chip. This weakness does not seem to be widely addressed; one exception is Chipkill.

Work published between 2007 and 2009 showed widely varying error rates with over 7 orders of magnitude difference, ranging from 10−10−10−17 error/bit·h, roughly one bit error, per hour, per gigabyte of memory to one bit error, per century, per gigabyte of memory.[2][8][9] A very large-scale study based on Google's very large number of servers was presented at the SIGMETRICS/Performance’09 conference.[8] The actual error rate found was several orders of magnitude higher than previous small-scale or laboratory studies, with 25,000 to 70,000 errors per billion device hours per megabit (about 3–10×10−9 error/bit·h), and more than 8% of DIMM memory modules affected by errors per year.

In most computers used for serious scientific or financial computing and as servers, ECC is the rule rather than the exception, as can be seen by examining manufacturers' specifications.

DRAM memory may provide increased protection against soft errors by relying on error correcting codes. Such error-correcting memory, known as ECC or EDAC-protected memory, is particularly desirable for high fault-tolerant applications, such as servers, as well as deep-space applications due to increased radiation.

Error-correcting memory controllers traditionally use Hamming codes, although some use triple modular redundancy.

Interleaving allows distributing the effect of a single cosmic ray, potentially upsetting multiple physically neighboring bits across multiple words by associating neighboring bits to different words. As long as a single event upset (SEU) does not exceed the error threshold (e.g., a single error) in any particular word between accesses, it can be corrected (e.g., by a single-bit error correcting code), and an effectively error-free memory system may be maintained.[10]

Some ECC memory uses triple modular redundancy hardware (rather than the more common Hamming code), because triple modular redundancy hardware is faster than Hamming error correction hardware.[10] Space satellite systems often use TMR,[11][12][13] although satellite RAM usually uses Hamming error correction.[14]

Effects of memory corruption

The consequence of a memory error is system-dependent. In systems without ECC an error can lead either to to a crash or to corruption of data: in large-scale production sites memory errors are one of the most common hardware causes of machine crashes.[8] Memory errors can cause security vulnerabilities.[8] A memory error can have no consequences if it changes a bit which neither causes observable malfunctioning nor affects data used in calculations or saved.

An example of a single-bit error that would be ignored by a system with no error-checking, would halt a machine with parity checking, or would be invisibly corrected by ECC: a single bit is stuck at 1 due to a faulty chip, or becomes changed to 1 due to background or cosmic radiation; a spreadsheet storing numbers in ASCII format is loaded, and the digit "8" is stored in the byte which contains the stuck bit as its eighth bit; then a change is made to the spreadsheet and it is saved. However, the "8" (00111000 binary) has silently become a "9" (00111001).

References

  1. ^ [1] Single Event Upset at Ground Level, Eugene Normand, Member, IEEE, Boeing Defense & Space Group, Seattle, WA 98124-2499
  2. ^ a b Borucki, "Comparison of Accelerated DRAM Soft Error Rates Measured at Component and System Level", 46th Annual International Reliability Physics Symposium, Phoenix, 2008, pp. 482–487
  3. ^ "CDC 6600". Research.microsoft.com. http://research.microsoft.com/~GBell/craytalk/sld047.htm. Retrieved 2011-11-23. 
  4. ^ "Parity Checking". Pcguide.com. 2001-04-17. http://www.pcguide.com/ref/ram/errChecking-c.html. Retrieved 2011-11-23. 
  5. ^ "pcguide; The Market's Change from Parity to Non-Parity Memory". Pcguide.com. 2001-04-17. http://www.pcguide.com/ref/ram/errMarket-c.html. Retrieved 2011-11-23. 
  6. ^ pcguide: Parity vs. Non-Parity: Pros and Cons
  7. ^ "Discussion of ECC on pcguide". Pcguide.com. 2001-04-17. http://www.pcguide.com/ref/ram/errECC-c.html. Retrieved 2011-11-23. 
  8. ^ a b c d http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
  9. ^ http://www.ece.rochester.edu/~xinli/usenix07/
  10. ^ a b "Using StrongArm SA-1110 in the On-Board Computer of Nanosatellite". Tsinghua Space Center, Tsinghua University, Beijing. http://www.apmcsta.org/File/doc/Conferences/6th%20meeting/Chen%20Zhenyu.doc. Retrieved 2009-02-16. 
  11. ^ "Actel engineers use triple-module redundancy in new rad-hard FPGA". Military & Aerospace Electronics. http://mae.pennnet.com/Articles/Article_Display.cfm?ARTICLE_ID=111934. Retrieved 2009-02-16. 
  12. ^ "SEU Hardening of Field Programmable Gate Arrays (FPGAs) For Space Applications and Device Characterization". Klabs.org. 2010-02-03. http://klabs.org/richcontent/Papers/Synopses/nsrec94.htm. Retrieved 2011-11-23. 
  13. ^ "FPGAs in Space". Techfocusmedia.net. http://www.techfocusmedia.net/fpgajournal/feature_articles/20040803_space.htm. Retrieved 2011-11-23. 
  14. ^ "Commercial Microelectronics Technologies for Applications in the Satellite Radiation Environment". Radhome.gsfc.nasa.gov. http://radhome.gsfc.nasa.gov/radhome/papers/aspen.htm. Retrieved 2011-11-23.